10 research outputs found
Dimensionality Reduction using PCA and K-Means Clustering for Breast Cancer Prediction
Breast cancer is the most important cause of death among women. A prediction of breast cancer in early stage provides a greater possibility of its cure. It needs a breast cancer prediction tool that can classify a breast tumor whether it was a harmful malignant tumor or un-harmful benign tumor. In this paper, two algorithms of machine learning, namely Support Vector Machine and Extreme Gradient Boosting technique will be compared for classification purpose. Prior to the classification, the number of data attribute will be reduced from the raw data by extracting features using Principal Component Analysis. A clustering method, namely K-Means is also used for dimensionality reduction besides the Principal Component Analysis. This paper will present a comparison among four models based on two dimensionality reduction methods combined with two classifiers which applied on Wisconsin Breast Cancer Dataset. The comparison will be measured by using accuracy, sensitivity and specificity metrics evaluated from the confusion matrices. The experimental results have indicated that the K-Means method, which is not usually used for dimensionality reduction can perform well compared to the popular Principal Component Analysis
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
We present NusaCrowd, a collaborative initiative to collect and unify
existing resources for Indonesian languages, including opening access to
previously non-public resources. Through this initiative, we have brought
together 137 datasets and 118 standardized data loaders. The quality of the
datasets has been assessed manually and automatically, and their value is
demonstrated through multiple experiments. NusaCrowd's data collection enables
the creation of the first zero-shot benchmarks for natural language
understanding and generation in Indonesian and the local languages of
Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual
automatic speech recognition benchmark in Indonesian and the local languages of
Indonesia. Our work strives to advance natural language processing (NLP)
research for languages that are under-represented despite being widely spoken
WEIRD FAccTs: How Western, Educated, Industrialized, Rich, and Democratic is FAccT?
Studies conducted on Western, Educated, Industrialized, Rich, and Democratic
(WEIRD) samples are considered atypical of the world's population and may not
accurately represent human behavior. In this study, we aim to quantify the
extent to which the ACM FAccT conference, the leading venue in exploring
Artificial Intelligence (AI) systems' fairness, accountability, and
transparency, relies on WEIRD samples. We collected and analyzed 128 papers
published between 2018 and 2022, accounting for 30.8% of the overall
proceedings published at FAccT in those years (excluding abstracts, tutorials,
and papers without human-subject studies or clear country attribution for the
participants). We found that 84% of the analyzed papers were exclusively based
on participants from Western countries, particularly exclusively from the U.S.
(63%). Only researchers who undertook the effort to collect data about local
participants through interviews or surveys added diversity to an otherwise
U.S.-centric view of science. Therefore, we suggest that researchers collect
data from under-represented populations to obtain an inclusive worldview. To
achieve this goal, scientific communities should champion data collection from
such populations and enforce transparent reporting of data biases.Comment: To appear at ACM FAccT 202
Embryo Grading after In Vitro Fertilization using YOLO
In vitro fertilization is an implementation of Assistive Reproductive Technology. This technology can produce embryos outside the mother's womb by manipulating gametes outside the human body. The success rate of in vitro fertilization is the selection of good-grading embryos. In this study, the authors used Yolo Version 3 to perform object detection objectively by introducing grades for each embryo image. The author uses an embryo image sourced from the Indonesian Medical Education and Research Institute with information on the quality of the embryo. In this study, the author separated the data into two schemes. The first scheme separates data into training data of 70%, 15% validation data, and 15% for testing data. The second scheme uses a Stratified K-Fold Cross-Validation with a fold value =3. In training, the writer configures the values ??of Max Batches=6000, Steps=4800,5400, Batch=64, and Subdivision=16 by doing image augmentation (saturation=1.5, exposure=1.5, hue=0.1, jitter=0.3, random=1). For each of the obtained mAP (Mean Average Precision) values ??for data separation schemes, one is 100.00% in the 6000th iteration, while for the two-data separation scheme, the highest mAP is 97.33%.% in the fold=3 and 5000th iteration. It means that both separation schemes are sufficient in terms of mAP